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Abstract. We introduce indexing of tables referencing complex structures such as digraphs 
and spatial objects, appearing in genetics and other data intensive analysis. The indexing is 
achieved by extracting dimension schemas from the referenced structures. The schemas and 
their dimensionality are determined by proper coloring algorithms and the duality between 
all such schemas and all such possible proper colorings is established. This duality, in turn, 
provides us with an extensive library of solutions when addressing indexing questions. It is illus- 
trated how to use the schemas, in connection with additional relational database technologies, 
to optimize queries conditioned on the structural information being referenced. Comparisons 
using bitmap indexing in the Oracle 9.2i database, on the one hand, and multidimensional 
clustering in DB2 8.1.2, on the other hand, are used to illustrate the applicability of the in- 
dexing to different technology settings. Finally, we illustrate how the indexing can be used 
to extract low dimensional schemas from a binary interval tree in order to resolve efficiently 
interval and stabbing queries. 

Keywords: Relational databases, indexing, intersection graphs, proper graph coloring, di- 
mensionality, data warehousing, dimension tables, clique structures, genetics digraphs, gene 
ontology, spatial indexing, interval queries, multidimensional clustering, bitmap indexing, hy- 
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1 Introduction 

Researchers at the Whitehead Institute/MIT 
Center for Genome Research, use flat file formats 
when working on whole genome assembly, accord- 
ing to a conversation with David Jaffe the group 
leader. Is this the rule or an exception? It still 
appears that flat file formats are currently used 
in many, if not most, computationally intensive 
biology related research. To us this represents a 
challenge: To come up with relational database 
technologies which allow efficient analysis of com- 
plex structures, e.g., from biology, using the re- 
lational database environment. 

The approach that we are led to in this pa- 
per involves generalizing existing ROLAP (Re- 
lational OLAP) methods. The generalization is 



achieved by applying proper graph colorings in 
order to extract dimension schemas from com- 
plex structures. A rather cryptic formula, see the 
corollary in section 3, dictates that this general- 
ization, when applied to many acyclic digraphs, 
will return schemas of low dimensionality. In 
turn, this low dimensionality suggests that the 
database - and, in particular, the business intelli- 
gence industry is well placed to provide new tools 
incorporating the generalized methods. 

Indexing of tables referencing graphs and 
other complex structures is studied in this paper. 
For digraphs the setting is as follows. Start with 
a table in a relational database. Assume that one 
of the columns in the table contains nodes from a 
directed graph. We consider the task of indexing 
or organizing the table in such a way that queries 



based on the information provided by the edges 
of the digraph are optimally evaluated. 

The indexing introduced differs from many 
other approaches in that it is not complete. 
In our case this means that only a data entry 
schema is materialized, as a table, and then ad- 
ditional mechanisms are borrowed from the re- 
lational database system. How exactly the data 
entry schemas, called clique schemas, are used 
to optimize queries varies, it is determined by 
the database optimizer. To complete the index- 
ing the additional relational mechanisms used 
here, for illustration purposes, are static bitmap 
(join) indexes and multidimensional clustering. 
This leads us to comparing using clique schemas 
in connection with multidimensional clustering in 
DB2, see [12], and using the schemas in connec- 
tion with static bitmap join indexing in Oracle, 
see [11]. The comparison reveals that the clique 
indexing schema can be used efficiently in connec- 
tion with various relational data access methods. 

It is worth pointing out that in the paper "An 
Array-Based Algorithm for Simultaneous Mul- 
tidimensional Aggregates" by Zhao, Deshpande 
and Naughton ([5]), clustered arrays and non- 
clustered tables are compared with respect to the 
CUBE operator. Our setup is somewhat similar 
but our results do not necessarily favor using mul- 
tidimensional structures as was observed for the 
CUBE operator in [5]. In fact, the Oracle setup, 
combining clique schemas and bitmap join index- 
ing, turns out to be much more efficient when an- 
alyzing data residing in memory than using mul- 
tidimensional clustering in DB2. 

Additional examples are also introduced. This 
includes clique indexing schemas for various set 
collections other than digraphs, such as struc- 
tures derived from sets of intervals or the binary 
interval tree. In the latter, the clique schemas 
may be used to answer efficiently stabbing and in- 
terval queries. The examples show that the tech- 
nology can be used to address a large class of 
problems involving complex structures. 



1.1 Chronology 

Section 3 contains a discussion about dimension- 
ality of graph structures. In section 4 we derive 
the clique indexing schema using proper graph 
coloring. The schema is then put in a context in 
which it provides an optimal solution to deter- 
mining data entries for indexing tables referenc- 
ing sets determined by a set-valued function. The 
indexing is tested in section 5 in connection with 
other relational database techniques. Additional 
applications are provided for illustration in sec- 
tion 6. In each of the additional applications we 
use the clique schema to implement spatial in- 
dexing, first we use a set-valued function which 
depends on the underlying spatial data and then 
we model a fixed interval tree instead. 



2 Related Work 

Indexing techniques for relational databases is a 
large field of study. An overview of indexing of 
semistructured data, e.g., acyclic digraphs and 
XML data, may be found in [10]. A particular in- 
dexing technique for digraphs based on encoding 
paths as strings is introduced in [7] . In [6] the au- 
thors, of that paper, describe an analysis frame- 
work for tree-structured balanced access meth- 
ods for evaluating performance, based on "per- 
formance loss metrics" and assumed optimal tree 
structures. Stabbing and interval queries may be 
resolved using various trees and structures and 
recently the relational interval tree has been in- 
troduces, see [8] for the Rl-tree and further ref- 
erences. 

Several relevant graph coloring results, exam- 
ples and observations regarding down-coloring of 
acyclic digraphs are obtained in [1] and [2]. In 
particular, estimates are obtained for the down- 
chromatic numbers of acyclic digraphs in [2] and 
performance factors established for the down- 
coloring of acyclic digraphs. An accessible intro- 
duction to graph coloring is found in [13]. 
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3 Dimensionality 

Today's decision support, also know as business 
intelligence, industry thrives on promoting tech- 
nology made possible, to a large degree, by the 
low dimensionality of digraphs describing busi- 
ness processes. For example, a typical location 
hierarchy has 4 levels i.e., store, city, state and 
country. The time dimension can likewise, usu- 
ally, be colored with few colors and so on. For sci- 
ence related applications the structures are more 
complicated, but we still may want to be able to 
take advantage of the efficiency (the online part 
in OLAP or online-analytical-processing) pro- 
vided by star-schema etc. like setups, made pos- 
sible by low chromatic numbers. Dimensionality 
in OLAP setups generalizes to down- chromatic 
numbers, defined in the papers [1] and [2]. We 
will explain this further, but first we review an 
important, although cryptic, formula from [2]. It 
is important because it bounds the dimension- 
ality of indexing structures required for acyclic 
digraphs. Just as in today's business intelligence 
systems, low dimensionality allows one to materi- 
alize the indexing (clique) schemas as dimension 
tables. 

In [2] the following is proved about the down- 
chromatic number of an acyclic digraph 

Corollary 1 (from [2]). If (? is an acyclic di- 
graph, then its down- chromatic number satisfies 
the following: 

1. Ifmd(H) = 1 or D0) = 2, then 

2. Ifind(H) > 1 and D0) > 2, then 
Xd0)<md(H)(D0)-2) + l. 

Where H = H-g is the down-hypergraph of (? . 
Moreover, the mentioned upper bounds in both 
cases are sufficient for greedy down-coloring of 

Before explaining the relevance of the corollary 
further we review the terminology used: 



— The down- chromatic number Xd{^) is the 
least number of "colors" that can be applied 
to the graph nodes so that any two nodes 
which have a common ancestor always receive 
different colors. 

— The down-hypergraph H is obtained from 
G by taking as its edges the maximal 
descendants- and-self sets D[u}. More explic- 
itly, D[u] contains u and all nodes that can 
be traversed starting from u in the digraph. 
Also, hypergraphs are allowed to have edges 
formed by connecting multiple nodes instead 
of just 1 or 2 nodes per edge for graphs. 

— The number ind(£f) is called the degeneracy 
or the inductiveness of the down-hypergraph. 
It is defined as 

md(H)=mzx S cv(H){S(H{S})} 

= max S cy(H) {minxes {d H[s] (u)}} 

where V(H) is the set of graph nodes (from 
(j) and H[S\ is the subhypergraph of H 
induced by S. The subhypergraph contains 
nodes from S and its edges are obtained 
as the (distinct) restrictions of maximal 
descendants- and-self sets D[u] to S and addi- 
tionally by requiring each edge, in H [S] , to be 
formed by at least two nodes. Finally, to com- 
plete the description of ind(TJ) we point out 
that the minimum degree of H[S], denoted 
by 8(H[S]), represents the absolute minimum 
count of edges from H[S] that contain the 
same element from S. 

— The number D(Cl) is the maximum number 
of nodes in a descendants-and-self set. 

The relevance of the corollary here is that 
the down-chromatic number Xd{G) estimated in 
the corollary represents the number of dimen- 
sions needed to "index" a table referencing the di- 
graph C*. The indexing is used to resolve queries 
conditioned on unions, intersections and compli- 
ments of ancestors-and-self sets derived from the 
digraph Tt . Eventually, we will extend our ap- 
proach to any set-valued mapping. We refer the 
reader to [2] for a more detailed discussion of 
these numbers. A high level discussion follows. 
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When considering the above formulas it is 
sometimes useful to reverse the arrows in an 
acyclic digraph so that the nodes converge at a 
"root" element, if such an element exists. Then, 
if the arrows are reversed, D{(x) is the number 
of nodes in the largest ancestors-and-self set in 
the original digraph and the down-coloring pro- 
duces an indexing schema allowing us to optimize 
queries referencing descendants-and-self sets. In 
this case, the schemas may therefore be consid- 
ered to be generalizations of dimension tables 
found in star-schema setups commonly used in 
relational databases. 

Most importantly, for many applications, in- 
cluding (reversed) tree digraphs, the degeneracy 
of the down-hypergraph is one, i.e., ind(if) = 1. 
Therefore, in these cases, only the first part of the 
corollary needs to be evaluated and the dimen- 
sionality or down-chromatic number, i.e., Xd(<?), 
is readily determined^ in the simplest way one 
can hope for, as D{G). Sometimes, such as for 
graphs similar to (reversed) genealogy digraphs, 
D(Ct) may grow fast, but eventually lack of in- 
formation and inbreeding will bound the largest 
ancestor set. On the other hand, polygamy can 
result in higher degeneracy numbers, requiring 
us to use the estimate in part two of the corol- 
lary. We finish this section with an example of 
a small "artificial" digraph C? with 'md(H) > 1 
and D(C!) > 2. The acyclic digraph below has 
md(H) = 3, £>(<?) = 3 and X d0) = 4. 

12 

/ \ 
1 2 

/ \ / \ 



\ / \ / 

4 3 

\ / 
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The digraph is defined so that each node "AB" is 
the source node for edges with target nodes "A" 



and "B". It can be down-colored with 4 colors. 
Other similar examples can be found in [2]. 

In the following sections we explain further 
how to extract schemas, similar to dimension ta- 
bles, from complex structures, such as digraphs 
or spatial maps. This will enable us to index ta- 
bles referencing the structures and take advan- 
tage of access plans developed for data warehous- 
ing. 

4 Tables Referencing Complex 
Structures 

In this section we create a simple theoretical 
framework for studying the indexing. In particu- 
lar, the duality between proper colorings and in- 
dexing of tables referencing complex structures is 
established. This duality allows us to take advan- 
tage of results known about graph coloring when 
creating indexing schemas and when maintain- 
ing the schemas during structure updates. Proper 
graph coloring has been studied extensively and 
is used to address numerous applied problems 
and therefore this connection provides us with 
a library of solutions when addressing indexing 
questions. In the previous section, we already 
showed how a particular coloring result is trans- 
lated into information about indexing schemas, 
providing the reader with a motivating example. 

4.1 Schema Representations 

Data entries in an index file are values that point 
to records in a data file, in our case to rows in a 
table referencing a domain V. It is assumed that 
each row contains a node, in a fixed column, from 
the domain V. In order to index the table, one 
needs to develop a data entry schema compati- 
ble with the queries that the index will resolve 
and a translation mechanism that translates the 
queries into operations carried out with the aid of 
the data entries, or more formally: Let the data 
entries be denoted by ei*, . . . , e„*, each data en- 
try points to rows in the table and these rows do 
in turn determine a set of nodes in V which we 
denote by F(ei*), the function F is referred to as 
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being set- valued. The data entry schema may be 
realized as a collection of functions {/} defined 
on subsets of V and with range in the data entries 
{ei*, . . . , e„*} and such that for a given data en- 
try e* there should be at least one function /, 
in the schema, satisfying F(e*) = / _1 (e*). That 
is, / assigns the data entry e* only to search key 
nodes from the set F(e*). Therefore, a complete 
schema is realized as a collection /i, / 2 , . . . , fk of 
functions so that for each data entry e* there ex- 
ists at least one integer c(e*) with 

For two distinct data entries e* and e'* we 
have that if the intersection F(e*) CI F(e'*) 
is nonempty, say contains a node w, then 
fc(e*){w) = e * and f c { e '*){w) = e'*, so since 
e* 7^ e'*, we must have c(e*) ^ c(e'*). It follows 
that the map e* c(e*) is a proper coloring of 
the intersection graph, Int(F), of the set- valued 
map F. To clarify this, we note that the intersec- 
tion graph has nodes 

V(hA(F)) = {e 1 *,...,e n *} 

and edges 

E(lnt(F)) = {e^ <-> ej * : e,* ^ ej* 

and F(ei*) H F(ej*) + 0}. 

Consequently, the number of functions, k, needed 
to realize the data entry schema is greater than or 
equal to the chromatic number, x(Int(F)), of the 
intersection graph. Keeping the number of func- 
tions close to this limit is important, since there 
are considerable performance and bounding costs 
associated with each additional function, as will 
become apparent in the next sections. 

Before continuing, it is worth pointing out the 
duality between proper colorings of the lrtt(F) 
graph and possible data entry realizations: Given 
a proper coloring c : V(lnt(F)) — > {l,...,fc} 
of the intersection graph, one constructs a com- 
plete schema for assigning data entries by defin- 
ing functions /i , . . . , /k as 

fi(u) = e*, if u € F(e*) and c(e*) = i. (1) 



The definition of Int(F) ensures that the func- 
tions are well defined. The duality between all 
the proper colorings and all the possible index- 
ing schemas has therefore been established. 

4.2 Generalized Dimension Tables 

We sometimes create the data entry schema 
without a referencing tabic. This is possible if 
the structure of the (maximal) set-valued map 
F is known beforehand, as is often the case 
when tables reference fixed digraphs. The schema 
fi , . . . , fk may be materialized in many differ- 
ent ways. Now we illustrate how to use data en- 
try schemas to build tables, similar, in function, 
to dimension tables used in data warehousing. 
These tables are called clique tables or schemas. 

As an illustration, consider a table referenc- 
ing a simple digraph £? and the task (again - see 
section 3 above), of indexing the table in such a 
way that queries conditioned on one or more of 
the descendants-and-self sets of the digraph may 
be efficiently evaluated. In this case it is straight- 
forward to pick as data entries the digraph nodes 
V{Gr) and have each data entry e* point to all 
rows in the table that reference e* or one of its 
descendants, i.e., D[e*]. The practicality of this 
approach depends largely on the chromatic num- 
ber of the intersection graph, Int (£>[•]), and the 
efficiency of the coloring algorithm select to prop- 
erly color the intersection graph. 

For digraphs that induce relatively small, e.g., 
less than 1000, chromatic numbers of Int (£>[•]), 
we devise a structure so that the set operations 
(U, n, \) can be optimally executed on elements 
from the set collection {D[u\ : u E V(C?)}. This 
is done by having the data entry schema func- 
tions, see formula (1) above, physically share the 
domain V((*) in the database, such as by mate- 
rializing the relation 

{(u,f 1 (u),...,f k (u)):ueV0)} 

in the database system, in addition to the col- 
oring map c. Following standard convention, this 
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requires that fj(u) — "NULL" if u is not in the 
domain of fj. We will refer to this relation in 
the below as Clique (£>[•]), it has ordered column 
headings: "node" , "cl" , . . . , "ck" . The data en- 
try schema is referred to as the clique indexing 
schema. In general, we let 

Cliquc(F) = ["node", "cl", "ck"] 

= {(u,f 1 (u),...,f k (u)):ueV}, 

representing a proper coloring of the intersection 
graph Int(F) and, at the same time, the schema 
/i, . . . , fk as determined by formula (1). 

In some cases, such as when working with ge- 
nealogy records that stretch many generations 
back, the chromatic number of the intersection 
graph may become too large for the clique schema 
to be formed directly. In these cases, one tries to 
invent a more practical set-valued function, and 
use it to build the clique schema, instead of using 
the descendants-and-self mapping directly. This 
will become apparent in the following sections 
in connection with the discussion of the interval 
graph. 

5 Multidimensional Clustering or 
Bitmap Join Indexing? 

In this section we compare different relational 
database technologies required to complete the 
indexing. We apply the indexing to a test setup 
involving a table referencing a digraph origi- 
nating from genetics or drug related research. 
The test setup involves queries resolved using 
the Cliquc(D[-]) indexing schema, created for the 
Gene Ontology (denoted by GO) digraph. The 
timing and bounding results are detailed in ap- 
pendix A. 

5.1 Comparison 

The clique indexing schema describes mappings 
of data entrie s, i. e., pointers, to rows in tables ref- 
erencing the U6 diagraph. Additional relational 
mechanisms are introduced to materialize the ac- 
tual mapping of the data entries to RID-s (unique 



row locations) in the referencing tables. We com- 
pare the performance of two different approaches: 

1. Use bitmap join indexing in the Oracle 9.2i 
database system to map data entries in the 
clique indexing schema to RID-s in a refer- 
encing table. Static bitmap indexes are es- 
sentially pointers, represented as compressed 
bit arrays, to sets of ordered rows in a ta- 
ble. Bitmap indexes therefore behave nicely 
with respect to logical operations and table 
lookups. These and a few other data ware- 
housing indexing techniques are discussed in 
[3]. 

2. Use multidimensional clustering in DB2 ver- 
sion 8.1.2 to index (using block indexes) 
and cluster a referencing table so that all 
rows referencing the same descendants-and- 
self set are always clustered together. Cur- 
rently this is only supported, in DB2, for 16 
color columns simultaneously so this was ap- 
plied to a subset of the nodes only. We com- 
pare clustering by 1, 2, 4, 8 and 16 color 
columns in the DB2 setup. Multidimensional 
clustering organizes tables according to pre- 
set column dimensions into chunks of data, 
physically located together on one or more 
pages, i.e., blocks, and according to a uniform 
treatment for all the organizing dimensions. 
Clustering of arrays is described in [4]. 

Not too surprisingly, it turns out that multi- 
dimensional clustering performs well when all the 
referencing data is retrieved from a disk storage, 
whereas bitmap join indexing outperforms by a 
huge ratio when a large part of the data is located 
in memory. 

5.2 Conclusions 

Figure 1 in appendix A, demonstrates that 
both technologies, multidimensional clustering 
and bitmap join indexing, are useful in the con- 
text of indexing tables referencing complex struc- 
tures when the indexing is facilitated by the 
clique data entry schema. 
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Explicitly, the timing of the queries shows the 
benefits of using multidimensional clustering in 
DB2 when the chromatic number of the associ- 
ated intersection graph of the referenced struc- 
ture is very small, i.e., 16 or less. Having to hor- 
izontally append the fact table in DB2 with the 
color columns is a cumbersome step though, es- 
pecially if the referenced structure changes with 
time. Another observation is that the size of the 
fact table, in DB2, will grow considerably if many 
color columns are appended to it, for the purpose 
of implementing multidimensional clustering. 

The setup used in Oracle is extremely sim- 
ple and is scalable to high chromatic numbers, 
i.e., many dimensions. It shows superior timing 
results when accessing table data that is already 
located in memory and bounds nicely as dimen- 
sionality increases. 

5.3 Inserts and Deletes 

Adding or deleting rows in the fact table cascades 
to the indexing structures, bitmap or clustering 
in the above. The responsibility to maintain the 
additional indexing structures therefore falls on 
the database management system in this case. 
This is more complicated when the digraph can 
also change, but the updating issues are similar 
to what is seen in data warehousing. If the di- 
graph changes frequently then one should con- 
sider decoupling the clique structure and the fact 
table, in order to not cause to much processing 
during updates. There are many relational index- 
ing techniques compatible with such a decoupling 
and used in data warehousing setups, we do not 
discuss these here. 

6 Additional Applications 

Now we demonstrate that the indexing is applica- 
ble to setups not just involving graph data. Both 
of the applications described may be used to re- 
solve intersection and stabbing queries efficiently 
and are therefore interesting in their own right. 



6.1 Spatial Indexing 

Clique indexing techniques can be used to ef- 
ficiently resolve stabbing and interval queries. 
Starting with a table containing or referencing 
intervals, one may define the set of data entries 
as the set of all endpoints of the intervals. The 
set- value function chosen for stabbing and inter- 
val queries may be selected as 

F(e*) = {Intervals I = (X, Y) in the table 
with Y > e * and X < e*}. 

The purpose of this definition is not necessar- 
ily to materialize the set valued function. It may 
be regarded as a theoretical object, used only 
in the definition of an intersection graph that 
will then be efficiently colored using proper graph 
coloring techniques. This particular intersection 
graph, Int(-F) can be properly colored extremely 
efficiently Essentially, a simple geometric argu- 
ment shows that the graph can be properly col- 
ored simply by ordering the nodes in an increas- 
ing order and then assigning the colors 1 through 
the chromatic number (or higher) sequentially 
and repeatedly to the ordered sequence of nodes. 
This requires an estimate of the chromatic num- 
ber to be available to the system which is an addi- 
tional step in the process. The dimensionality of 
this particular schema can be high, so it may have 
to be replaced with more than one schema, each 
referencing intervals of similar length, to keep the 
accumulated dimensionality low. 

Once the intersection graph Int(F) has been 
properly colored and materialized in a clique in- 
dexing schema, e.g., a table called "Clique_F", 
an intersection query may be formed as follows. 
Assume that the "node" column in "Clique_F" 
references closed interval objects with endpoints: 
nodc.x and node..y, and that [a, b] is an arbitrary 
interval, also assume that N* is the greatest data 
entry smaller than or equal to b and, finally, that 
the data entry N* has been assigned color "n" . 
In this case the intersection query: Select all in- 
tervals that intersect the closed interval [a, b], is 
issued as: 
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- select node from Clique_F where Cn = N* 
union all 

select node from Clique_F where node..y be- 
tween a and b 

Each part of this query can be efficiently evalu- 
ated, for example, by using a bitmap index on the 
color column and a B+Tree index on the node..y 
column. 



6.2 Homology Index 

Above, we created an indexing structure that 
uses endpoints of the intervals as data entries. 
This requires considerable processing when the 
interval data changes. Another approach is to use 
fixed reference intervals and introduce additional 
filtering to root out false positive hits. Consider 
the interval tree shown 
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• 5 • 
10 I 11 



• 6 • 
12 I 13 



• 7 • 
14 I 15 



This binary interval tree enumerates 2™ — 1 in- 
tervals between [0, 1] where n is the number of 
levels chosen, in the case shown we have n = 4. 
Assuming we are interested in evaluating overlap 
queries, it is straightforward to pick the data en- 
tries as the intervals, numbered 1, 2, 3, . . . , 2™ — 1 
and set the set-valued function F(i) as all inter- 
vals that overlap interval number i, e.g., F(5) = 
{1,2, 5, 10, 11}. One quickly realizes that for this 
particular choice we have 



x(Int(F)) 



1. 



Since 1 G F(i) always. This (usually) high chro- 
matic number suggests that we should pick an- 
other function. The function, G, which we will 
use instead is defined as follows. The data entries 
used are the integer pairs (p, q)* from the set: 

{(p, q) : q = 1, 2, . . . , n and p = 1, 2, . . . , 2« - 1}. 

The set-valued function G, defined on this set 
and returning sets of intervals from the interval 



tree, is then determined by: 

k G G(p, q) * if and only if 
(k = p) or (2« < 2p < k < 2™ and [^J = p). 

Here [xj is the floor function, i.e., returns the 
largest integer at most x. The intersection graph 
Int(G) can be properly colored with n colors, 
since G(pi,q) (~l G(p 2 , q) = 9 if p\ ^ pi- The col- 
oring map is given by 

c(p, q)* = q and x(Int(G)) = n. 

This much more reasonable chromatic number (n 
instead of 2" — 1) enables us to materialize the 
clique indexing schema into a compact relation, 
i.e., "clique_G". For example, if n = 20 we obtain 
a table with 1,048,575 rows and 21 columns and 
containing no "NULL" values. 

The mechanism that assigns data entries to 
a given interval, in order to resolve an overlap 
query, in the interval graph is constructed as fol- 
lows. Consider the k-th interval in the tree. It is 
located on level L, determined by L = [\og 2 (k)J + 
1 . Recursively define a sequence R\ , . . . , Rl of 
length L by Rl — k, and then R t ~\ = |Rt/2j, 
note that i?i = 1 . Then the overlapping intervals 
are obtained as the disjoined union G(Rl,L) * 
UG(R L -i,L)*---UG(Ri,L)*. So, referring back 
to the set F(5) = {1, 2, 5, 10, 11}, which we used 
to illustrate the earlier choice with n=4, we now 
obtain 

F(5) = G(5, 3) * UG(2, 3) * UG(1, 3)* 
= {5,10,11}U{2}U{1}. 

In a database system this simply means that the 
intersection query: Find all intervals in the inter- 
val tree that intersect interval number k in the 
tree, is issued as: 

- select node from clique.G where 
cL in (R L ,R L -!, . . . , 1). 

There is no need to store the data entries as 
pairs in the "clique_G" relation since the second 
value is just the color and therefore is the same 
within each column, i.e., it is enough to store the 
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first number "p" in the color columns, instead of 
(p, q)*. This explains why we write "cL" in the 
above query for the L-th color column, instead of 
"cL..p", to obtain the first value only. 

Of course, this structure is also easily ob- 
tained by direct calculations. The advantage of 
materializing this compact schema, "clique.G", 
shown below for n = 4, is that it facilitates fast 
resolution of intersection queries when the inter- 
vals reference the interval tree. 
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Our approach is to use the "clique_G" relation, in 
the Oracle 9.2i database, to build n bitmap join 
indexes, one for each color to index a table that 
references features of the DNA. The indexing is 
then used to resolve queries involving the map- 
ping of feature sets from one species to another. 

Acknowledgments 

This research has been sponsored by deCODE 
Genetics, Inc. in Iceland. 



A Appendix: Testing 



Here we document the testing described in 
section 5. 

A.l Hardware and Data 

The computer used is a 2 processor Intel Pentium 
III 667MHz system each with integrated 256KB 
level 2 cache. It has a 512MB ECC RDRAM 
memory and all the database files are located on 
a WD 200GB, 7200 RPM disk with 8MB cache 
and lOOMB/s ATA interface. The operating sys- 
tem is Windows 2000 Professional SP3. Care was 
taken to make sure that both databases, Oracle 
9.2i and DB2 8.1.2 are optimally tuned and were 
able to take advantage of the limited resources 
available on this machine. The "page size" was 
kept at 4KB in DB2. 

Two tables, one with approximately 1.5 mil- 
lion rows and the other with approximately 3 
million rows reference the gene ontology, Gf3, 
graph in a column called "acc" . The tables con- 
tain a number column also, called "m" , and other 



columns with various values, the referencing ta- 
bles will be referred to as fact tables. 

A. 2 The Gene Ontology Digraph 

The gene ontology acyclic digraph, G(3, is devel- 
oped by the Gene Ontology Consortium, [9]. A 
greedy coloring algorithm may be used to prop- 
erly color its intersection graph Int (£)[•]) with 36 
colors. The clique indexing schema Clique(Z)[-]), 
denoted by cliquc_GO in the queries below, there- 
fore has 37 columns (Node, cl,c2,..., c36) and one 
row for each node in the digraph, i.e., over 11,000 
nodes, following the convention established in 
section 4.2. 

A. 3 Fig. 1: Description and Queries 

Figure 1 shows results from the first two compar- 
isons provided and detailed below. In DB2's case 
we always use the color columns in the clique 
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First Comparison with o = 1 / 24 



Second Comparison with o = 1 / 204 
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Fig. 1. Comparisons with a equal to 1/24 and 1/204 



structure to cluster the fact tables using multi- 
dimensional clustering. In Oracle's case we al- 
ways indexed the fact tables using bitmap join 
indexes derived from the color columns in the 
clique structure. 

Referring back to Figure 1, the vertical axes 
show query execution time in seconds. The lower 
horizontal axes show the dimensionality of the 
system being tested. In DB2's case this means 
the number of color columns used to organize 
the fact tables using multidimensional cluster- 
ing. In the first comparison shown, dimension- 
ality equal to one indicates that the fact tables 
are clustered using color column "c8" only, di- 
mensionality equal to 2 indicates that the clus- 



tering is organized by colors 7 and 8 and so on 
until the fact table is clustered simultaneously by 
all the color columns from 1 to 16 - which is the 
maximum supported by DB2. In the second com- 
parison shown, color columns 9 through 24 were 
used as clustering columns simultaneously. In Or- 
acle's case the dimensionality, i.e., 36, is simply 
the number of color columns in the clique rela- 
tion. The Oracle setup is similar to a star-schema 
with 36 dimensions or dimension levels (indepen- 
dent), in which the clique indexing schema acts 
as a dimension table. 

Each setup is executed both for a smaller 
fact table with approximately 1.5 million rows 
and a larger setup with approximately 3.0 mil- 
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lion rows. The range of execution times for the 
smaller setup is always shown to the left and the 
range of execution times for the larger setup is 
always to the right. The time ranges are broad 
since they vary from the database having to read 
all the data from disk (cold database) to the ex- 
ecution times seen in repeated executions. 

The top horizontal axes show the size of the 
tables in the database, both for the larger and 
the smaller fact table setup. As expected the size 
of the multidimensional clustered table increases 
substantially as it is organized by more and more 
dimensions. On the other hand, the star schema 
setup used in the Oracle database does not in- 
crease the size of the fact table as dimensionality 
increases, but the accumulated size of all the 36 
bitmap join indexes, one for each color column, 
was only about 40MB in the Oracle 9.2i database. 

The value of a shown for each comparison on 
figure 1 is simply the ratio of rows in the fact ta- 
ble touched by the respective queries. So a value 
of a = 1/24 means that about 5% of the rows 
in the fact table satisfy the query, and the other 
value of g — 1/204 indicates that only 0.5% of 
the rows satisfy the second query. More precisely, 
figure 1 shows execution times for the following 
queries: 

First Comparison: For a = 1/24, we select a 
node from the Gene Ontology graph assigned 
code '00:0006810' representing 'transport' (The 
directed movement of substances such as macro- 
molecules, small molecules, ions into, out of, 
within or between cells). The query aggregates 
"m" over all rows that reference this term or one 
of its many descendants in the digraph. As in- 
dicated about 5% of the rows in the fact table 
satisfy this query. The node '00:0006810' was 
assigned color 8 by the proper coloring algorithm 
applied to the intersection graph. 

- In DB2 the query used is of the form: 
select sum(m) from MDC.gotermfact 
where c8 = '00:0006810'. 



The necessary color columns included in the 
multidimensional clustering were added to 
the fact table. 

- In Oracle the query was formed as: 
select sum(f.m) 

from gotermfact f, clique.GO d 
where d.node = f.acc and 

d.c8 = 'GO:0006810'. 
The "clique_GO" table is the clique index- 
ing schema as explained above. Additionally, 
a hint (+index_combine(f)) was inserted into 
the query to make sure that Oracle selected 
to use the available bitmap join indexes. 

Second Comparison: For a = 1/204 we used the 
following nodes 'GO: 0015171' (i.e., amino acid 
transporter activity), 'GO:0015203' (i.e., poly- 
amine transporter activity), 'GO:0015291' (i.e., 
porter activity) and aggregate over all rows in the 
fact table that reference a node which is the de- 
scendant of all these three nodes simultaneously. 
The nodes were assigned colors 13, 21 and 23 
respectively by the coloring of the intersection 
graph. 

- This time, the query used for DB2 is as fol- 
lows: 

select sum(m) 
from MDC_gotermfact_9to24 
where cl3 = 'GO:0015171' and 

c21 = 'GO:0015203' and 
c23 = 'GO:0015291'. 
The table is organized (clustered) by all the 
16 color columns from c9 to c24. 

- In Oracle the same query is formed as: 
select sum(f.m) 

from gotermfact f, clique_GO d 
where d.node = f.acc and 

d.cl3 = '00:0015171' and 
d.c21 = '00:0015203' and 
d.c23 = '00:0015291'. 
Additionally, a hint was inserted to ensure 
the used of bitmap join indexes. 
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